CuMAPz: Analyzing the Efficiency of Memory Access Pattern in CUDA
نویسندگان
چکیده
Even though the entry barrier of writing a GPGPU program is lowered with the help of many high-level programming models, such as NVIDIA CUDA, it is still very difficult to optimize a program so as to fully utilize the given architecture’s performance. The burden of GPGPGU programmers is increasingly growing as they have to consider many parameters, especially on memory access pattern, and even a small change of those parameters can lead to a drastic performance change, which is not obvious, or often counterintuitive, before careful analysis. In this paper, we focus on optimizing a CUDA program using shared memory. We present a tool that analyzes the efficiency of given parameters on memory access pattern. Given a set of parameters, the tool analyzes data reuse, global memory access coalescing, shared memory bank conflict, partition camping, and branch path divergence. The output of the tool is profitability, a comprehensive performance metric introduced in this paper. Profitability can be used to compare the efficiencies of different sets of parameters, without even writing a program. Experimental results show that profitability can accurately predict the change of the performance of a program as we change the memory access pattern related parameters.
منابع مشابه
Performance evaluation of GPU memory hierarchy using the FFT
Modern GPUs (Graphics Processing Units) are becoming more relevant in the world of HPC (High Performance Computing) thanks to their large computing power and relative low cost, however their special architecture results in more complex programming. To take advantage of their computing resources and develop efficient implementations is essential to have certain knowledge about the architecture a...
متن کاملGenerating GPU Code from a High-Level Representation for Image Processing Kernels
We present a framework for representing image processing kernels based on decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access pattern of a kernel. The framework performs source-to-source translation of kernels expressed in highlevel framework-specific C++ classes into low-level CUDA or OpenCL code with effective device-dependent ...
متن کاملParallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach
There are several different methods to make an efficient strategy for steganalysis of digital images. A very powerful method in this area is rich model consisting of a large number of diverse sub-models in both spatial and transform domain that should be utilized. However, the extraction of a various types of features from an image is so time consuming in some steps, especially for training pha...
متن کاملA new approach to the lattice Boltzmann method for graphics processing units
Emerging many-core processors, like CUDA capable nVidia GPUs, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since global memory on graphic devices shows high latency and LBM is data intensive, memory access pattern is an important issue to achieve good performances. Whenever possible, global memory loads and stores should be coalescent and a...
متن کاملAutoGPU : Automatic Generation of CUDA Kernel Code
Manual optimization of a CUDA kernel can be an arduous task, even for the simplest of kernels. The CUDA programming model is such that a high performance may only be achieved if memory accesses in the kernel follow certain patterns; further, fine-tuning of the kernel execution and loop configuration may result in a dramatic increase in performance. The number of possible such configurations mak...
متن کامل